A common strategy to deal with these issues is to build repeated (weak learners) models on the same data and combine them to form a single result.
These are called ensemble or consensus estimators/predictors.
As a general rule, ensemble learners tend to improve the results obtained with the weak learners they are made of.
Ensemble can be built on different learners but we will focus on those built on trees:
Two questions arise here:
The bootstrap has been applied to almost any problem in Statistics.
We begin with the easiest and best known case: estimating the standard error (that is the square root of the variance) of an estimator.
\[\begin{eqnarray*} \theta &=& E_F(X)=\theta (F) \\ \theta &=& Med(X)=\{m:P_F(X\leq m)=1/2\}= \theta (F). \end{eqnarray*}\]
\[\begin{eqnarray*} \hat{\theta}&=&\overline{X}=\int XdF_n(x)=\frac 1n\sum_{i=1}^nx_i=\theta (F_n) \\ \hat{\theta}&=&\widehat{Med}(X)=\{m:\frac{\#x_i\leq m}n=1/2\}=\theta (F_n) \end{eqnarray*}\]
An important when computing an estimator \(\hat \theta\) of a parameter \(\theta\) is how precise is \(\hat \theta\) as an estimator of \(\theta\)?
\[ \sigma _{\overline{X}}=\frac{\sigma (X)}{\sqrt{n}}=\frac{\sqrt{\int [x-\int x\,dF(x)]\sp 2dF(x)}}{\sqrt{n}}=\sigma _{\overline{X}}(F) \]
then, the standard error estimator is the same functional applied on \(F_n\), that is:
\[ \hat{\sigma}_{\overline{X}}=\frac{\hat{\sigma}(X)}{\sqrt{n}}=\frac{\sqrt{1/n\sum_{i=1}^n(x_i-\overline{x})^2}}{\sqrt{n}}=\sigma _{\overline{X}}(F_n). \]
The bootstrap method makes it possible to do the desired approximation: \[\hat{\sigma}_{\hat\theta} \simeq \sigma _{\hat\theta}(F_n)\] without having to to know the form of \(\sigma_{\hat\theta}(F)\).
To do this,the bootstrap estimates, or directly approaches \(\sigma_{\hat{\theta}}(F_n)\) over the sample.
The bootstrap allows to estimate the standard error from samples of \(F_n\), that is:
substitution of \(F_n\) by \(F\) is carried out in the sampling step. %(instead of doing it when calculating \(\sigma_{\hat{\theta}}(F)\))}.
Instead of doing:
\[F\stackrel{s.r.s}{\longrightarrow }{\bf X} = (X_1,X_2,\dots, X_n) \, \quad (\hat \sigma_{\hat\theta} =\underbrace{\sigma_\theta(F_n)}_{unknown}) \]
\[ F_n\stackrel{s.r.s}{\longrightarrow }\quad {\bf X^{*}}=(X_1^{*},X_2^{*}, \dots ,X_n^{*}) \quad (\hat \sigma_{\hat\theta}= \hat \sigma_{\hat \theta}^* \simeq \sigma_{\hat \theta}^*). \]
Here, \(\sigma_{\hat \theta}^*\) is the bootstrap standard error of \(\hat \theta\) and
\(\hat \sigma_{\hat \theta}^*\) the bootstrap estimate of the standard error of \(\hat \theta\).
This means that the sampling process consists of extracting samples of size \(n\) of \(F_n\) , that is:
The samples \({\bf X^*}\), obtained through this procedure are called bootstrap samples or re-samples.
\[ \mbox{if }B\rightarrow\infty \mbox{ then } \hat{\sigma}_B (\hat\theta) \rightarrow \hat\sigma_{\infty} (\hat\theta) =\sigma_B(\hat\theta)=\sigma_{\hat\theta}(F_n). \]
The bootstrap approximation, \(\hat{\sigma}_B(\hat\theta)\), to the bootstrap SE, \(\sigma_B(\hat\theta)\), provides an estimate of \(\sigma_{\hat\theta}(F_n)\):
\[ \hat{\sigma}_B(\hat\theta)(\simeq \sigma_B(\hat\theta)=\sigma_{\hat\theta}(F_n))\simeq\hat \sigma_{\hat\theta}(F_n). \]
From real world to bootstrap world:
\[\hat f_{bag}(x)=\frac 1B \sum_{b=1}^B \hat f^{*b}(x) \]
\[ \hat G_{bag}(x) = \arg \max_k \hat f_{bag}(x). \]
Since each out-of-bag set is not used to train the model, it can be used to evaluate performance.
This exampe relies on the well-known AmesHousing dataset on house prices in Ames, IA.
We use libraries:
rpart for stratified resamplingipred for bagging.